Combining learning approaches for incremental on-line parsing

نویسندگان

  • Deryle W. Lonsdale
  • Michael Manookin
چکیده

This paper discusses the integration of two different machine learning approaches to modeling language, NL-Soar and analogical modeling (AM). The resulting hybrid system is capable of functionality that is not possible when using only one of the systems in isolation. After a brief introduction of each system, an explanation is given of how AM is used to provide information useful to NL-Soar for two tasks. Examples are given, and related issues are outlined. Introduction Ongoing investigation in computational language modeling involves assessing the relative strengths and weaknesses of various approaches, for example symbolic versus subsymbolic, or rule-based versus exemplar-based. In this paper we show how an exemplar-based method is able to provide two types of crucial information that otherwise might not be available to a symbolic cognitive modeling system. In the following section we sketch two substantially different machine learning approaches to language modeling. Next, we mention two well-studied natural language learning tasks: named entity recognition and prepositional phrase attachment resolution. The subsequent section discusses how two systems, NL-Soar and Analogical Modeling, have been combined in a way that brings together their relative strengths in novel and interesting ways involving these two tasks. In the last section we present conclusions, observations, and ideas for future work. Language modeling: symbolic and exemplar-based We begin by discussing two heretofore unrelated systems that have traditionally been used to model different language use phenomena: NL-Soar and analogical modeling (AM). Their complementarity motivates this integration: the former provides cognitive-level control, and the latter gives robust lowlevel instance-based matching. Natural-language Soar Natural-language Soar (NL-Soar) is an agent-based, hierarchical, goal-directed machine learning system that is based on the Soar cognitive modeling approach [Newell, 1990]. It has been used to model language use in a variety of modalities—comprehension [Lonsdale and Rytting, 2001], generation [Lonsdale, 2000], discourse [Green and Lehman, 2002]—in a variety of communicative task settings. A rule-based system, its basic knowledge repository is a set of if-then productions. Probabilistic reasoning is not a core feature of the basic architecture, and this introduces various challenges when addressing language-related tasks (among others). The system receives lexical input word-by-word, and lexical access is performed for each word in turn. During lexical access WordNet [Fellbaum, 1998] provides relevant morphological, syntactic, and semantic information for all of the senses and homographs of the word in question [Rytting and Lonsdale, 2001]. The system then attempts to integrate the incoming words incrementally into linguistic models: a syntactic X-bar parse tree, and a semantic lexicalconceptual structure. All potential and possible syntactic and semantic material is considered in piecing together licit constructions. Constraints operate to rule out attachments that do not follow standard principles. In certain cases, some types of limited structure can be undone and reformulated when ongoing hypotheses prove untenable in the presence of new incoming words. The semantic conceptual primitives are based on WordNet’s lexical filenames and senses, which constitute (respectively) course-grained and fine-grained categories such as vbody for body verbs (e.g. sneezed, tripped) and n-plant for rhododendron1. Given the high degree of lexically-based ambiguity in English, much processing in NL-Soar involves determining compatibility between words and phrases at the syntactic and semantic levels. The system, which is symbolic in its functionality, relies on a set of hand-coded rules and on-line learning as it performs the task. Its limited backtracking abilities allow modeling such processes as local ambiguity resolution, syntactic garden pathing, and complexity-induced breakdown in human parsing performance [Lewis, 1993]. Analogical modeling Analogical modeling (AM) is a data-driven, exemplar-based approach to modeling language [Skousen, 1989] and other types of data. It has no rule-based component, either explicit or implicit, requires no explicit knowledge representations beyond the set of exemplars, and is a flexible and robust language modeling paradigm. Several linguistic applications have been reported using analogical modeling as the basic approach involving phonology, morphology, word sense disambiguation, speech processing, and lexical selection. So far In reality the WordNet codes are verb.body and noun.plant but this paper uses abbreviated names as shown. it has only modeled low-level individual tasks, precluding its use as a comprehensive modeling framework. The system operates as follows. A set of exemplars that address and illustrate a particular linguistic phenomenon is prepared; each instance has a fixed-length feature-vector encoding that represents salient (and perhaps nonsalient or questionable) properties for that instance. Each instance is labelled with an outcome that is used by the system to output how that instance behaves with respect to the phenomenon in question. At run time, the user inputs to the system a set of queries in the form of similarly encoded feature vectors. The system matches each input query with the exemplar base, and generates one or more probabilistically weighted outcomes. The system is able to tolerate noisy or incomplete data and behaves differently than other approaches in that it takes into consideration so-called “gang effects” that are problematic for the more traditional machine-learning languge-modeling methods. More details are available elsewhere concerning the system’s application to language [Skousen, 1989], its statistical foundations and processing metrics [Skousen, 1992], NLP applications [Jones, 1996], and recent comparative work [Skousen et al., 2002]. Relevant tasks In developing and scaling up the NL-Soar system, several issues arose that were not solvable using traditional symbolic methods. In this section we survey two such problems and how solutions were achieved from recent research in the natural language learning community. Proper-noun semantics Proper nouns are a crucial component of natural language, but in previous versions of NL-Soar they were not addressed. Using WordNet as the primary lexical resource allows for the retrieval of some of the more common proper nouns that are contained therein, such as “Virginia” and “France”. For such words WordNet also provides semantic information; for example, different senses of “Washington” are encoded as ngroup, n-location, and n-person. Of course, only a small fraction of proper nouns are included in WordNet, which is problematic for NL-Soar’s processing. A simple approach for syntax is to assume that any non-sentence-initial capitalized word encountered in text and absent from WordNet acts syntactically a proper noun 2. Following this assumption has been straightforward and successful for handling the syntax of proper nouns. On the other hand, determining the semantics of proper nouns not included in WordNet has been more problematic. Fortunately, this so-called named entity recognition (NER) problem has recently undergone extensive study by the natural language learning community [Tjong Kim Sang and De Meulder, 2003]. Though it has not been discussed in previous conferences on the topic, AM, like other modeling approaches, has been successfully used for named entity recognition. Using standardized data sets released from previous CoNLL shared tasks3, AM researchers have been able to achieve state-of-the-art results For the purposes of this paper we do not discuss words like “eBay”, or “and” in multi-word proper-noun expressions. See http://lcg-www.uia.ac.be/conll200 2,3 /ner. for English, Dutch, and Spanish. Exemplar vector encodings used such features as the lexical item itself, its part-of-speech information, shallow-parsed constituent information, and the standard I0B codes for semantic classification. Providing NER data for the semantics of noun phrases has been one motivation for integrating NL-Soar and AM processing. Before discussing and exemplifying this integration, we first mention another ideal application where a hybrid approach is advantageous. PP attachment The prepositional phrase attachment (PP-attachment) problem is an important and widely studied issue in natural language processing. Determining syntactic PP-attachment is even problematic for humans, as different attachment sites lead to multiple semantic interpretations (e.g. “I saw the man with the telescope.”). Psycholinguistic research shows that human strategies for resolving PP-attachment ambiguities include the use of argument relations in the sentence [Schelstraete, 1996, Schuetze and Gibson, 1999], prosodic cues [Schafer, 1998, Straub, 1998], lexical constraints [Boland and Boehm-Jernigan, 1998], and context [Ferstl, 1994]. Others [Spivey-Knowlton and Sedivy, 1995] have demonstrated that lexical bias and contextual information have a strong effect. NL-Soar, as a rule-based symbolic system, has traditionally inferred PP-attachments primarily using lexical subcategorization information (i.e. WordNet verb frames). Thus leveraging the complement/adjunct distinction is based largely on data provided by WordNet. Figure 1 reflects parses of two sentences: (a) “The minister warned the president of the danger.”, and (b)-(c) “The minister warned the president of the republic of the danger.” During parsing of the latter sentence, “...of the republic” is first temporarily linked as the PP-complement of “warned” (cf. “...of the danger” in (a)). When the second preposition is encountered, though, NL-Soar removes “...of the republic” from the verb’s complement position and remakes the structure by adjoining the PP to the noun “president”. The second “of” is then linked in as the PP-complement of “warned” (b). The parse then completes (c). Though the system has been capable of handling relatively complex constructions like the one just discussed, a large class of PP-attachment scenarios could not be processed by the system. In particular, problems arose when the attachment decision was determined, not by subcategorization information on the matrix verb, but rather via lexical semantic information contained in the oblique object of the preposition. This leads to familiar structural ambiguities, which are sometimes ambiguous even for humans: I saw the man with a beard/telescope. Here attachment is determined by the PP object (“beard” or “telescope”), not by the subcategorization of “saw”. Hence subcategorization information alone is insufficient to make PP-attachment decisions in many contexts; another approach was needed by NL-Soar to deal with ambiguities of this type. The next section presents the solution to this problem. Past approaches to PP-attachment disambiguation have focused on statistical [Hindle and Rooth, 1993, Ratnaparkhi et al., 1994, Collins and Brooks, 1995] or rulebased [Brill and Resnik, 1994] methods. The statistical

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Framework for Building an Efficient Incremental Intrusion Detection System

In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...

متن کامل

Incremental On-line Clustering with a Topology-Learning Hierarchical ART Neural Network Using Hyperspherical Categories

Incremental on-line learning is an important branch of machine learning. One class of approaches particularly well-suited to such tasks are Adaptive Resonance Theory (ART) neural networks. This paper presents a novel ART network combining the ability of noise-insensitive, incremental clustering and topology learning at different levels of detail from TopoART with Euclidean similarity measures a...

متن کامل

Integration of Dependency Parsers for Bulgarian

Various parsing models — based on different parsing algorithms or different sets of features — produce errors in different places in the dependency trees — [6]. This observation initiated a wide range of research devoted to combining the outputs of various parsing models with the hope to achieve a better parsing result. In this paper we present our work on combining results from 14 parsing mode...

متن کامل

Combining Deep and Shallow Approaches in Parsing German

The paper describes two parsing schemes: a shallow approach based on machine learning and a cascaded finite-state parser with a hand-crafted grammar. It discusses several ways to combine them and presents evaluation results for the two individual approaches and their combination. An underspecification scheme for the output of the finite-state parser is introduced and shown to improve performance.

متن کامل

Effects of Parsing Errors on Pre-Reordering Performance for Chinese-to-Japanese SMT

Linguistically motivated reordering methods have been developed to improve word alignment especially for Statistical Machine Translation (SMT) on long distance language pairs. However, since they highly rely on the parsing accuracy, it is useful to explore the relationship between parsing and reordering. For Chinese-toJapanese SMT, we carry out a three-stage incremental comparative analysis to ...

متن کامل

An Efficient Algorithm To Induce Minimum Average Lookahead Grammars For Incremental LR Parsing

We define a new learning task, minimum average lookahead grammar induction, with strong potential implications for incremental parsing in NLP and cognitive models. Our thesis is that a suitable learning bias for grammar induction is to minimize the degree of lookahead required, on the underlying tenet that language evolution drove grammars to be efficiently parsable in incremental fashion. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004